A data-driven crop model for maize yield prediction

Accurate estimation of crop yield predictions is of great importance for food security under the impact of climate change. We propose a data-driven crop model that combines the knowledge advantage of process-based modeling and the computational advantage of data-driven modeling. The proposed model tracks the daily biomass accumulation process during the maize growing season and uses daily produced biomass to estimate the final grain yield. Computational studies using crop yield, field location, genotype and corresponding environmental data were conducted in the US Corn Belt region from 1981 to 2020. The results suggest that the proposed model can achieve an accurate prediction performance with a 7.16% relative root-mean-square-error of average yield in 2020 and provide scientifically explainable results. The model also demonstrates its ability to detect and separate interactions between genotypic parameters and environmental variables. Additionally, this study demonstrates the potential value of the proposed model in helping farmers achieve higher yields by optimizing seed selection.


A Data-Driven Crop Model for Maize Yield Prediction
Supplementary Note 1: Input Data Yield and Geographic Data: • T C: set of county-year (t, c) combinations, for which observed yield data exist • y t,c : observed corn yield in year t in county c.
Weather Data: The dataset included seven variables, and six (excluding snow water equivalent) are used in the model.  Table (Valu1) in the database, the names and descriptions of which from [1] are summarized as follows.
• S aws c : available water storage in county c, expressed in mm, the volume of plant available water that the soil can store in this layer based on all map Supplementary Materials for A Data-Driven Crop Model for Maize unit components. This variable was measured by the average value of zone 5 (0-150cm). • S tka c : thickness of soil components in county c, expressed in cm for the available water storage calculation. This variable was measured by the average value of zone 5 (0-150cm). • S soc c : soil organic carbon stock estimate in county c, expressed in grams C per square meter to a depth of 5 cm. This variable was measured by the average value of zone 5 (0-150cm). • S tks c : thickness of soil components in county c, expressed in cm for the soil organic carbon calculation. This variable was measured by the average value of zone 5 (0-150cm).
• S nccpi3corn c : national commodity crop productivity index for corn (weighted average) in county c. The values range from 0.01 (low productivity) to 0.99 (high productivity). • S pctearthmc c : national commodity crop productivity index for major earthy components in county c, which are those soil series or higher level taxa components that can support crop growth. • S rootznemc c : root zone depth in county c, expressed in mm, is the depth within the soil profile that commodity crop roots can effectively extract water and nutrients for growth. • S rootznaws c : root zone available water storage estimate in county c, expressed in mm, is the volume of plant available water that the soil can store within the root zone based on all map unit earthy major components. • S droughty c : drought vulnerable landscapes in county c, which comprise those map units that available water storage within the root zone for commodity crops is less than or equal to 6 inches (152 mm), expressed as "1" for a drought vulnerable soil landscape map unit or "0" for a non-droughty soil landscape map unit. • S pwsl1pomu c : potential wetland soil landscapes (PWSL)in county c, expressed as the percentage of the map unit that meets the PWSL criteria.

Supplementary Note 2: Variables
Genotypic Parameters: a separate seed profile is extracted to represent the average characteristics of the genotypes grown in each county in each year. Although many genotypes are used in reality across different counties and regions with inevitable overlap, the genotypic parameter describes an average profile of a genetic portfolio that is allowed to vary by county and evolve over time. Maize growth has two phases, vegetative and reproductive. Vegetative phase subdivisions are characterized by three developmental timing epochs: planting, emergence, and peak of LAI. Subdivisions of reproductive stage can be defined as: pollination, milk, dent, and maturity. Thus, we identified 7 developmental timing epochs and 6 subdivisions in total during the maize growth process. We use i ∈ {v1, v2, v3, r1, r2, r3} to represent the 6 subdivision periods between the 7 developmental timing epochs of vegetative and reproductive stages. Genotypic parameters consist of eleven parameters under 6 growth subdivisions: (1) required growing degree days of finishing the growth stage, (2) proportion of biomass allocated to leaf, (3) proportion of biomass allocated to root, (4) proportion of biomass allocated to grain, (5) radiation use efficiency per leaf weight, (6) leaf size, (7) leaf senescence rate, (8) root water uptake rate, (9) transpiration efficiency coefficient (10) leaf cover rate per leaf weight to prevent water evaporation from soil, and (11) fraction of grain loss if waterlogging happens. And three other parameters consist of the year-county combination: (1) reference plant density, (2) base temperature for the GDD calculation, and (3) high cutoff temperature for GDD calculation. Soil Variables: underground soil has a water holding capacity, and excess water beyond this capacity will drain at a certain rate. According to Obertson and Fukai [2], the data-driven crop model simulates a one-layer soil to track maize growth and yield. Soil water evaporates at a certain rate on a daily basis. A soil profile consists of seven parameters that define the aforementioned properties: (1) initial water level ratio, (2) soil waterlogging level, (3) soil water holding capacity, (4) drainage rate at which excess water drains level, (5) precipitation level when water runoff occurs, (6) water runoff rate, and (7) soil water evaporation rate.
• s initial water ratio t,c : initial water ratio of the soil in county c and year t, affected by rainfall after the harvest day of last year, in percentage.

Supplementary Note 3: Crop Model Function
Step 1: Initialization for d = 0 for all (t, c). Step 2: Simulation for days before planting, when  According to the descriptive modeling framework stated in the Method Section, the phenology clock for maize are described in Equations (C.13) and (C.14). Equations (C.8) -(C.12) and (C.21) -(C.25) represent the soil water module with and without the consideration of crop water intake, respectively. The potential water uptake by maize is described in Equation (C.18). The radiation and daily biomass production modules can be explained by Equations  279  280  281  282  283  284  285  286  287  288  289  290  291  292  293  294  295  296  297  298  299  300  301  302  303  304  305  306  307  308  309  310  311  312  313  314  315  316  317  318  319  320  321 [3]. Equation (C.6) calculates the saturated vapor pressure with the instruction [4]. Equation (C.7) gives us the vapor pressure deficit that can be used to calculate the water demand of maize plant in the next step. Equations (C.8) -(C.12) update the soil water level with a comprehensive combination of evaporation, runoff and drain water.
Equations (C.13) and (C.14) describe how the growing degree days are accumulated when considering the likelihood that a high temperature cutoff is occurring. Equation (C.15) shows how to generate a LAI value with a given leaf weight. Equation (C.16) calculates the maximum potential biomass that can be generated under ideal water supply conditions. Equations (C.17) -(C.19) indicate how water deficit index is computed. The water supply is determined by the soil water level and roots. Water demand is calculated as defined by the APSIM [5]. Using water deficit index, we can obtain the actual daily biomass generated by the photosynthesis process in Equation (C.20). Equations (C.21) -(C.25) are similar to Equations (C.8) -(C.12) and used to calculate the soil water level during the maize growth. Equation (C.26) records the number of days that waterlogging occurs. Equations (C.27) and (C.28) describe how root and leaf growth with the partial of produced biomass. In the reproductive stage, the root is no longer growing, the leaf senescence is shown in equation (C.30), and the grain begins to accumulate biomass as described in equation (C.29).
The data-driven crop model is a complex nonlinear optimization problem that is not readily solvable by standard machine learning algorithms. Herein, we present a heuristic algorithm that can efficiently extract a high quality solution (without guarantee of global optimality) for the soil profile (s, L) and the genetics profile (g) parameters for all year-county combinations. The strategy is to iteratively update one of these profile variables at a time while keeping the other fixed, which is done by exploring a small neighborhood with different step sizes. As such, model (8)-(11) was solved as a simulation optimization problem. Detailed steps of the heuristic algorithm are explained as follows.
Step 2: Update g * t,c . for all (t, c), randomly select a variable in g * t,c and try increasing and decreasing its value with different step sizes. Evaluate the objective (8) for all new values of this variable. Update the incumbent solution g * t,c with the new value that resulted in the lowest RMSE.
Terminate the algorithm if neither (s * , L * ) nor g * t,c for any (t, c) has been updated or the running time limit has been reached; otherwise go back to Step 1 for a new iteration.